Recap

  • What is web scraping?
  • rvest and polite
  • What is a function?
  • File paths and RStudio projects

Outline

  1. Regular expression
  2. Why do we want to analyze text data?
  3. Steps for text analysis
  4. R packages for text analysis
  5. Tidy text
  6. Stop words
  7. Sentiment of the text
  8. Word Importance

Regular Expression

Regular Expression

Regular expressions provide a concise and flexible way to define patterns in strings.

At their most basic level, they can be used to match a fixed string, allowing the pattern to appear anywhere from one to multiple times within a single string.

str_view()

We will use stringr::str_view() to demonstrate various regular expression syntax.

The str_view() function highlights matching patterns by enclosing them in <>.

Metacharacter

Metacharacters have special meaning in regular expression.

Common Metacharacters (1/3)

  • .: match any character except for \n
  • ^: match the starting position within the string
  • $: match the ending position of the string
  • |: match the expression before or the expression after the operator

Metacharacter

Quantifiers control how many times a pattern matches.

Common Metacharacters (2/3)

  • *: match the preceding element zero or more times
  • {m,n}: match the preceding element at least \(m\) and not more than \(n\) times
  • ?: match the preceding element zero or one time
  • +: match the preceding element one or more times

Metacharacter

A character set, allows you to match any character in a set.

Remember, \ needs to be escaped.

Common Metacharacters (3/3)

  • []: match a single character that is contained within the brackets
  • [^]: match a single character that is not contained within the brackets
  • [a-z]: match any lower case letter
  • [A-Z]: match any upper case letter
  • [0-9]: match any number
  • \d: match any digit
  • \w: match any word character (letter and number)

Grouping and Capturing

Parentheses create capturing groups, enabling you to work with specific subcomponents of a match.

Tip

You can reuse these groups in your pattern, where \1 refers to the match inside the first set of parentheses, \2 refers to the second, and so forth.

  • .* means 0 or more of any character

Text Analysis

What is Text Analysis?

Text analysis is a set of techniques that enable data analysts to extract and quantify information stored in the text, whether it’s from messages, tweets, emails, books, or other sources.

For example:

  • Predicting Melbourne house prices based on realtor descriptions.
  • Gauging public discontent with Melbourne train stoppages using Twitter data.
  • Identifying differences between the first and sixth editions of Darwin’s Origin of the Species.
  • Analyzing and quantifying sentiment within a given text.

Text Analysis Process

We will use the tidytext package for the first three steps and the gutenbergr package to obtain text data.

  1. Import the text.
  2. Pre-process the data by removing less meaningful words, known as stop words.
  3. Tokenize the text by breaking it into words, sentences, n-grams, or chapters.
  4. Summarize the results.
  5. Apply modeling techniques.

tidytext

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use.

Lets’ start with a conversation from Game of Thrones:

What is Tidy Text Format?

Tidy text format is a table with one-token-per-row.

A token is a meaningful unit of text, such as a word, that we are interested in using for analysis.

Tokenization (unnest_tokens()) is the process of splitting text into tokens.

Unit for Tokenization - Characters

Use characters as tokens.

Unit for Tokenization - Ngrams

Ngrams are groups of words define by n.

Analyzing User Reviews for Animal Crossing: New Horizons (A Nintendo Game)

The dataset consists of user and critic reviews for Animal Crossing: New Horizons, scraped from Metacritic.

This data was sourced from a #TidyTuesday challenge.

Grade Distribution

Warning

A value of 0 could indicate missing data!

Positive Reviews

Negative Reviews

Remove the “Expand” from the Text

Long reviews are compressed from the scraping procedure.

We will remove these characters from the text.

Tidy up the Reviews

Use unnest_tokens() to convert the data into tidy text format.

Distribution of Words Per Review

Note

  • 58% of reviewers write fewer than 75 words, while 36% write more than 150 words.

  • most users tend to provide brief feedback, while a smaller group of more engaged reviewers write longer, more detailed responses.

Most Common Words

Note

Certain common words, such as “the” and “a,” don’t contribute much meaning to the text.

Stop Words

In computing, stop words are words that are filtered out before or after processing natural language data (text).

These words are generally among the most common in a language, but there is no universal list of stop words used by all natural language processing tools.

While stop words often do not add meaning to the text, they do contribute to its grammatical structure.

English Stop Words

Lexicon: a word book or reference word book.

Chinese Stop Words

Various Lexicons

See ?get_stopwords for more info.

Comparing Lexicons by the Number of Stopwords

It is perfectly acceptable to start with a pre-made word list and remove or append additional words according to your particular use case.

Remove Stopwords

You can replace filter() with an anti_join() call, but filter() makes the action clearer.

Note

The most common words are fitting, as the game is a popular Nintendo title for the Switch console, where players can create and play on their own island paradise with animal villagers.

Frequency of Words in User Reviews

user_reviews_words %>%
  anti_join(stopwords_smart) %>%
  count(word) %>%
  arrange(-n) %>%
  top_n(20) %>%
  ggplot(aes(fct_reorder(word, n), n)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(title = "Frequency of words in user reviews",
  subtitle = "",
  y = "",
  x = "")

Let’s have a break!

Sentiment Analysis

Sentiment Lexicons

Sentiment analysis is the process of determining the emotional tone or opinion expressed in a piece of text.

It is commonly used to analyze customer feedback, reviews, and social media.

Three widely used general-purpose lexicons for sentiment analysis are:

  1. AFINN: developed by Finn Årup Nielsen, which assigns words a score ranging from -5 to 5, with negative scores reflecting negative sentiment and positive scores reflecting positive sentiment.
  2. bing: created by Bing Liu and collaborators, which classifies words into two simple categories: positive or negative.
  3. nrc: by Saif Mohammad and Peter Turney, which categorizes words into emotions such as anger, anticipation, disgust, fear, joy, sadness, surprise, and trust, in addition to positive and negative sentiment.

All three lexicons are based on unigrams (single word).

Sentiment Analysis

  • One approach to analyzing text sentiment is to treat the text as a combination of individual words.
  • The overall sentiment is determined by summing the sentiment values of the individual words.
  • However, this method can be inaccurate as it overlooks context, word order, and the impact of phrases that can change the meaning of individual words.
  • Solution: Machine learning applied to large-scale text datasets can address these challenges. You will explore these concepts in ETC3250/ETC5250 and ETC3555/ETC5555.

Sentiment Lexicons

Use get_sentiments() to get the Lexicons.

Sentiments in the Reviews

inner_join() return rows from reviews if the word can be found in the Lexicon.

Visualizing Sentiments

user_reviews_words %>%
  inner_join(sentiments_bing) %>%
  count(sentiment, word, sort = TRUE) %>%
  arrange(desc(n)) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~sentiment, scales = "free") +
  theme_minimal() +
  labs(title = "Sentiments in user reviews", x = "") 

Average Sentiment Per Review

The average sentiment per review improves as the grade increases.

Common Words over Grades

Some common words appear in both very positive and very negative reviews, so how do we determine their importance?

Word Importance

Word Importance

How do we measure the importance of a word to a document in a collection of documents?

For example a novel in a collection of novels or a review in a set of reviews…

We combine the following statistics:

  • Term frequency
  • Inverse document frequency

Term Frequency

The raw frequency of a word \(w\) in a document \(d\). It is a function of the word and the document.


\[ tf(w, d) = \frac{\text{count of } w \text{ in } d}{\text{total number of words in } d} \]


The term frequency for each word is the number of times that word occurs divided by the total number of words in the document.

Term Frequency

For our reviews a document is a single user’s review. More about that here.

Inverse Document Frequency

The inverse document frequency tells how common or rare a word is across a collection of documents. It is a function of a word \(w\), and the collection of documents \(\mathcal{D}\).


\[ idf(w, \mathcal{D}) = \log\left(\frac{\text{size of } \mathcal{D}}{\text{number of documents that contain }w}\right) \]


If every document contains \(w\), then \(\log(1) = 0\).

Inverse Document Frequency

For the reviews data set, our collection is all the reviews. You could compute this in a somewhat roundabout as follows:

All together: Term Frequency, Inverse Document Frequency

Multiply tf and idf together. This is a function of a word \(w\), a document \(d\), and the collection of documents \(\mathcal{D}\):


\[ tf\_idf(w, d, \mathcal{D}) = tf(w, d) \times idf(w,\mathcal{D}) \]


A high tf_idf value indicates that a word appears frequently in a specific document but is relatively rare across all documents.

Conversely, a low tf_idf value means the word occurs in many documents, causing the idf to approach zero and resulting in a small tf_idf.

tf_idf

we can use tidytext to compute those values:

What Words Were Important to (A Sample of) Users that Had Positive Reviews?

user_reviews_words %>%
  anti_join(stopwords_smart) %>%
  count(user_name, word, sort = TRUE) %>%
  bind_tf_idf(term = word, document = user_name, n = n) %>%
  arrange(user_name, desc(tf_idf)) %>%
  filter(user_name %in% c("Alucard0", "Cbabybear", "TheRealHighKing")) %>%
  group_by(user_name) %>%
  top_n(5) %>%
  mutate(rank = paste("Top", 1:n())) %>%
  ungroup() %>%
  mutate(word = interaction(rank, word, lex.order = TRUE, sep = " : ")) %>%
  mutate(word = `levels<-`(rev(word), rev(levels(word)))) %>%
  ggplot() +
  geom_col(aes(word, tf_idf)) +
  facet_wrap(~user_name, ncol = 1, scales = "free_y") +
  coord_flip()

Practice in Your Own Time

Text Mining with R has an example comparing historical physics textbooks:

Discourse on Floating Bodies by Galileo Galilei, Treatise on Light by Christiaan Huygens, Experiments with Alternate Currents of High Potential and High Frequency by Nikola Tesla, and Relativity: The Special and General Theory by Albert Einstein. All are available on the Gutenberg project.

Work your way through the comparison of physics books. It is section 3.4.

Resoruces: Thanks